9 Empirical Distribution

#EmpiricalDistribution #Convergence #MultivariateCLT #OrderStatistics #Quantile #GaussianProcess #BrownianBridgeKernel #MarkovInequality

1 Convergence of Empirical Distribution

Suppose $X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} F$ , where $F (x) = P (X \leq x)$ is an unknown c.d.f. We want to estimate $F : R \to [0, 1]$ .
A natural estimator is the empirical distribution ${\hat{F}}_{n} : R \times Ω \to [0, 1]$ : $\hat{F_{n}} (x) = \frac{1}{n} \sum_{i = 1}^{n} I (X_{i} \leq x),$ where for $ω \in Ω$ , $I (X_{i} \leq x) (ω) = {\begin{aligned} 1, X_{i} (ω) \leq x, \\ 0, otherwise . \end{aligned}$

$X_{i} (ω) \leq x \Leftrightarrow ω \in X_{i}^{- 1} (- \infty, x]$ .

Note that $\begin{aligned} E [I (X_{i} \leq x)] & = P (I (X_{i} \leq x) = 1) \\ = P ({ω \in Ω | X_{i} (ω) \leq x}) \\ = P (X_{i} \leq x) = F (x) \leq 1. \end{aligned}$
So by SLLN, $\forall x \in R$ , ${\hat{F}}_{n} (x) \overset{a . s .}{\to} F (x)$ . I.e. $\forall x \in R$ , $P (lim_{n \to \infty} {\hat{F}}_{n} (x) = F (x)) = 1.$
If we expand the limit claim, $\forall ε > 0, \exists N (x, ω, ε)$ , s.t. $\forall n \geq N (x, ω, ε)$ , $| {\hat{F}}_{n} (x, ω) - F (x) | < ε$ . Here $N$ depends on $x$ , so is pointwise convergence.

One can obtain a stronger result:

Theorem (Glivenko-Cantelli)

Suppose $X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} F$ . Then $sup_{x} | {\hat{F}}_{n} (x) - F (x) | \overset{a . s .}{\to} 0.$ In other words $P (lim_{n \to \infty} sup_{x} | {\hat{F}}_{n} (x) - F (x) | = 0) = 1.$

If we also expand it, $\forall ε > 0$ , $\exists N (ω, ε)$ s.t. $\forall n \geq N (ω, ε)$ , $| {\hat{F}}_{n} (x, ω) - F (x) | < ε, \forall x \in R$ . Here $N$ does not depend on $x$ , so is uniform convergence.

The following proof and discussions are inserted after later notes. Readers can skip this part for now.

Define $D_{n} = sup_{x} | {\hat{F}}_{n} (x) - F (x) | .$
So this theorem is equivalent to $D_{n} \overset{a . s .}{\to} 0$ . However today we only prove a weaker version: $D_{n} \overset{p}{\to} 0$ .

Lemma1

$X_{1}, \dots, X_{n} \sim F$ continuous. The distribution of $D_{n}$ is the same for all continuous $F$ .

Proof of Lemma 1

Recall the quantile function $q (u) = inf {x \in R | F (x) \geq u} .$
We have shown this result: $U \sim Uniform (0, 1)$ , $X \sim F$ , continuous. Then $q (U) \overset{d}{=} X$ and $F (X) \overset{d}{=} U$ . Then $\begin{aligned} D_{n} & = sup_{x \in R} | {\hat{F}}_{n} (x) - F (x) | = sup_{u \in (0, 1)} | {\hat{F}}_{n} (q (x)) - F (q (U)) | \\ = sup_{u \in (0, 1)} | {\hat{F}}_{n} (q (u)) - u |, \\ {\hat{F}}_{n} (q (U)) & = \frac{1}{n} \sum_{i = 1}^{n} 1 {X_{i} \leq q (u)} = \frac{1}{n} \sum_{i = 1}^{n} 1 {F (X_{i}) \leq u}, \end{aligned}$
But $F (X_{1}), \dots, F (X_{n}) \overset{i . i . d}{\sim} Unif (0, 1)$ , which proves the claim.

Proof of Theorem

Let $U_{1}, \dots, U_{n} \overset{i . i . d}{\sim} Unif$ . Recall order statistics $U_{(1)} < \dots < U_{(n)}$ .
${\hat{F}}_{n} (u) = \frac{1}{n} \sum_{a = 1}^{n} 1 {U_{a} \leq u}$ . $sup_{u \in (0, 1)} | {\hat{F}}_{n} (u) - u |$ occurs at either $U_{(k)}$ or $U_{(k)}^{-}$ for some $k \in [n] = {1, \dots, n}$ . ${\hat{F}}_{n} (U_{(k)}) = \frac{k}{n}, {\hat{F}}_{n} (U_{(k)}^{-}) = \frac{k - 1}{n}$ . So $D_{n} = max {max_{k \in [n]} | \frac{k}{n} - U_{(k)} |, max_{k \in [n]} | \frac{k - 1}{n} - U_{(k)} |} .$
By this result, $U_{(k)} \sim Beta (k, n - k + 1)$ . So $E [U_{(k)}^{m}] = \frac{n!}{(n + m)!} \cdot \frac{(k + m - 1)!}{(k - 1)!} = \frac{(k + m - 1) \dots k}{(n + m) \dots (n + 1)} .$
For every $ε > 0$ , $\begin{aligned} P [max_{k \in [n]} | U_{(k)} - E [U_{(k)}] | \geq ε] \\ = & P [⋃_{k \in [n]} {| U_{(k)} - E [U_{(k)}] | \geq ε}] \\ \leq & \sum_{k \in [n]} P (| U_{(k)} - E [U_{(k)}] | \geq ε) \\ \leq & \frac{n c}{n^{2} ε^{4}} = \frac{c}{n ε^{4}} \to 0 (n \to \infty) \end{aligned}$ here note that $\begin{aligned} P (| U_{(k)} - E [U_{(k)}] |^{4} \geq ε^{4}) & \leq \frac{E [| U_{(k)} - E [U_{(k)}] |^{4}]}{ε^{4}} \leq \frac{C}{n^{2} ε^{4}} . \end{aligned}$
by Markov's inequality. So $max_{k \in [u]} | U_{(k)} - E [U_{(k)}] | \overset{p}{\to} 0$ .
Furthermore, $E [U_{(k)}] = \frac{k}{n + 1}$ and for every $k \in [n]$ , $\begin{aligned} | E [U_{(k)}] - \frac{k}{n} | = | \frac{k}{n (n + 1)} | \to 0, \\ | E [U_{(k)}] - \frac{k - 1}{n} | = | \frac{n - k + 1}{n (n + 1)} | \to 0, \end{aligned}$
so $D_{n} \overset{p}{\to} 0$ .

2 Relation to Brownian Bridge Kernel

Recall Multivariate CLT: $\sqrt{n} ([\begin{matrix} {\hat{F}}_{n} (u_{1}) \\ ⋮ \\ \hat{F_{n}} (u_{k}) \end{matrix}] - [\begin{matrix} u_{1} \\ ⋮ \\ u_{k} \end{matrix}]) = \frac{1}{n} \sum_{a = 1}^{n} [\begin{matrix} 1 {U_{a} \leq u_{1}} \\ ⋮ \\ 1 {U_{a} \leq u_{k}} \end{matrix}] \overset{d}{\to} N_{k} (\vec{0}, Σ),$
where $Σ = (Σ_{i j})_{i, j = 1, \dots, K}$ with $\begin{aligned} Σ_{i j} & = Cov (1 {U \leq u_{i}}, 1 {U \leq u_{j}}) \\ = E [1 {U \leq u_{i}} 1 {U \leq u_{j}}] - E [1 {U \leq u_{i}}] E [1 {U \leq u_{j}}] \\ = min {u_{i}, u_{j}} - u_{i} u_{j} . \\ E [1 {U \leq u_{i}} 1 {U \leq u_{j}}] & = P (U \leq u_{i}, U \leq u_{j}) = min {u_{i}, u_{j}}, \\ E [1 {U \leq u_{i}}] & = P (U \leq u_{i}) = u_{i} . \end{aligned}$
This is true for all $k$ . So the RHS corresponds to a Gaussian process with the Brownian Bridge Kernel ${B^{br} (u), u \in (0, 1)}$ .
Hence $\sqrt{n} D_{n} = sup_{u \in (0, 1)} \sqrt{n} | {\hat{F}}_{n} (u) - u | \overset{d}{\to} sup_{u \in (0, 1)} | B^{br} (u) | .$
We have another fact about Brownian bridge kernel:

Theorem (Kolomogorov-Smirnov)

$P (sup_{u \in (0, 1)} | B^{br} (u) | > x) = 2 \sum_{k = 1}^{\infty} (- 1)^{k + 1} e^{- 2 k^{2} x^{2}} .$

The first term $2 e^{- 2 x^{2}}$ alone is very accurate.

So if $n$ is large, $P (D_{n} > \frac{x}{\sqrt{n}}) \approx 2 e^{- 2 x^{2}} .$
This can be used to find an asymptotic level $- α$ confidence interval for estimating $F (u)$ simultaneously for all $u$ . $2 e^{- 2 x^{2}} = u \Rightarrow x = \sqrt{\frac{1}{2} \ln \frac{2}{α}}, \frac{x}{\sqrt{n}} = \sqrt{\frac{1}{2 n} \ln \frac{2}{α}} .$